Method and system for generating acoustic fingerprints

ABSTRACT

A method and system for generating an acoustic fingerprint of a digital audio signal is presented. A received digital audio signal is downsampled, based upon a predetermined frequency, and then subdivided into a beginning portion, a middle portion and an end portion. A plurality of beginning frames, a plurality of middle frames and a plurality of end frames, each having a predetermined number of samples, are extracted from the beginning, middle and end portions of the downsampled, digital audio signal, respectively. A plurality of frame vectors, each having a plurality of spectral residual bands and a plurality of time domain features, are generated from the plurality of beginning, middle and end frames, and an acoustic fingerprint of the digital audio signal is created based on the plurality of frame vectors. The acoustic fingerprint is then stored in a database.

CLAIM FOR PRIORITY/CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to U.S. Provisional Patent ApplicationSer. No. 60/497,328 (filed Aug. 25, 2003), which is incorporated hereinby reference in its entirety. This application is related to U.S.Non-provisional patent application Ser. No. 09/931,859 (filed Aug. 20,2001, now abandoned), which is incorporated herein by reference in itsentirety.

TECHNICAL FIELD

The present invention relates to digital signal processing. Morespecifically, the present invention relates to a method and system forgenerating acoustic fingerprints that represent perceptual properties ofa digital audio signal.

BACKGROUND OF THE INVENTION

Acoustic fingerprinting has historically been used primarily for signalrecognition purposes, including, for example, terrestrial radiomonitoring systems. Since these systems monitor continuous audiosources, acoustic fingerprinting solutions typically accommodated thelack of delimiters between given signals. However, these systems wereless concerned with performance because a particular monitoring systemdid not need to discriminate between large numbers of signals, andfunctioned with primarily analog signal distortions. Additionally, thesesystems do not effectively process many of the common types of signaldistortion encountered with compressed digital audio signals, such asnormalization, small amounts of time compression and expansion, envelopechanges, noise injection, and psycho acoustic compression artifacts.

There have been various attempts to automate audio sequencing, rangingfrom collaborative filtering and metadata driven solutions, to human orrules-based classification, to machine-listening systems. These havesuffered from various deficiencies, including laborious humandassification, large amounts of user preference training data, aninability to handle unknown unclassified audio, usage of a singledescription for an entire audio work, etc. None have been able toflexibly index audio from radio, microphone sources, digital libraries,and internet sources in a heterogeneous manner. Additionally, while somehave addressed the issue of finding similar works, they are unable tosequence result lists as well, due to a lack of temporal information inthe audio description, especially when comparing works of varyinglengths.

SUMMARY OF THE INVENTION

Embodiments of the present invention are directed to a method and systemfor generating an acoustic fingerprint of a digital audio signal. Areceived digital audio signal is downsampled, based upon a predeterminedfrequency, and then subdivided into a beginning portion, a middleportion and an end portion. A plurality of beginning frames, a pluralityof middle frames and a plurality of end frames, each having apredetermined number of samples, are extracted from the beginning,middle and end portions of the downsampled, digital audio signal,respectively. A plurality of frame vectors, each having a plurality ofspectral residual bands and a plurality of time domain features, aregenerated from the plurality of beginning, middle and end frames, and anacoustic fingerprint of the digital audio signal is created based on theplurality of frame vectors. The acoustic fingerprint is then stored in adatabase.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a logic flow diagram, showing the basic, batched model ofbuilding a reference SoundsLike print database, according to anembodiment of the present invention.

FIG. 2 is a logic flow diagram, giving an overview of the audio streampreprocessing step, according to an embodiment of the present invention.

FIG. 3 is a logic flow diagram, giving more detail of the SoundsLikeprint generation step, according to an embodiment of the presentinvention.

FIG. 4 is a logic flow diagram, giving more detail of the time domainfeature extraction step, according to an embodiment of the presentinvention.

FIG. 5 is a logic flow diagram, giving more detail of the spectraldomain feature extraction step, according to an embodiment of thepresent invention.

FIG. 6 is a logic flow diagram, giving more detail of the beat trackingfinalization step, according to an embodiment of the present invention.

FIG. 7 is a logic flow diagram, giving more detail of the second stageFFT feature step, according to an embodiment of the present invention.

FIG. 8 is a logic flow diagram, giving more detail of the framefinalization step, including spectral band residual computation, andwavelet residual computation and sorting, according to an embodiment ofthe present invention.

FIG. 9 is a block diagram that illustrates a system architecture thataccording to an embodiment of the present invention.

FIG. 10 is a block diagram that illustrates the architecture of theSoundsLike print database component, according to an embodiment of thepresent invention.

FIG. 11 is a logic flow diagram, giving more detail of the SoundsLikeprint comparison process, according to an embodiment of the presentinvention.

FIG. 12 is a logic flow diagram, giving more detail of the feature framecomparison function, according to an embodiment of the presentinvention.

FIG. 13 is a logic flow diagram, showing the SoundsLike print orderingprocess, according to an embodiment of the present invention.

FIG. 14 is a top level flow diagram that illustrates a method forgenerating an acoustic fingerprint of a digital audio signal, accordingto an embodiment of the present invention.

DETAILED DESCRIPTION

FIG. 9 depicts a block diagram that illustrates a system architectureaccording to an embodiment of the present invention. System 900 mayinclude acoustic fingerprint generation module 910, acoustic fingerprintcomparison module 911, and acoustic fingerprint reference database 912.Acoustic fingerprint identification module 913 may also be provided.Acoustic fingerprint generation module 910, acoustic fingerprintcomparison module 911 and acoustic fingerprint identification module 913may be implemented as software components, hardware components or anycombination thereof. Generally, system 900 may be coupled to a network.In an embodiment, acoustic fingerprint generation module 910, acousticfingerprint comparison module 911, acoustic fingerprint referencedatabase 912 and acoustic fingerprint identification module 913 may beindividually coupled to a network, or to each other, in various ways(not shown in FIG. 9).

According to various embodiments of the present invention, acousticfingerprints are created from a digital audio sound stream, which mayoriginate from a digital audio source such as, for example, a compressedor non-compressed audio datafile, a CD, a radio broadcast, a microphone,etc. In one embodiment, acoustic fingerprint comparison module 911 andacoustic fingerprint reference database 912 are located on a centralnetwork server (not shown in FIG. 9) in order to provide access tomultiple, networked users, while in another embodiment, acousticfingerprint generation module 910, acoustic fingerprint comparisonmodule 911 and acoustic fingerprint reference database 912 reside on thesame computer (as generally shown in FIG. 9).

Acoustic fingerprint comparison module 911 may precompute results foreach acoustic fingerprint in acoustic fingerprint reference database912, using one or more weight sets, in order to support quick retrievalof search results on devices with low processing power, such as, forexample, portable audio players. Acoustic fingerprint identificationmodule 913 may map a short input (such as a 30 second microphonecapture, or a hummed query) to a full, reference acoustic fingerprint.

Acoustic fingerprints may be formed by subdividing a digital audiostream into discrete frames, from which various temporal and spectralfeatures, such as, for example, zero crossing rates, spectral residuals,Haar wavelet residuals, trailing spectral power deltas, etc., may beextracted, summarized, and organized into frame feature vectors. In apreferred embodiment, several constant length frames are extracted fromthe beginning, middle, and end of a digital acoustic signal and sampledat locations proportionate to the length of the signal. In a furtherembodiment, the middle frames may be created by averaging one or moreconstant length feature frames to produce a constant length acousticfingerprint, which advantageously allows variable-length musical works(i.e., digital audio signals) to be compared while maintaining eachworks' temporal features, including, for example, transitioninformation. Song reordering, based on acoustic fingerprint comparisonsusing subsets of frames, as well as overall similarity searching, may beprovided.

In one embodiment, acoustic fingerprints are compared by calculating aweighted Manhattan distance between a given pair of acousticfingerprints. Additionally, comparisons focusing on a subset of frames,such as, for example, comparing the beginning portion of an acousticfingerprint to the end portions of other acoustic fingerprints, may beused to determine similarity for sequencing, for example. In oneembodiment, comparisons are performed on a nearest neighbor set ofacoustic fingerprints by acoustic fingerprint comparison module 911, andidentifiers are then associated with each element of acousticfingerprint reference database 912. Acoustic fingerprint comparisonmodule 911 may provide the appropriate identifiers when a set of similaracoustic fingerprints is found.

In a preferred embodiment, a similarity query is performed in responseto the activation of a button on a digital audio playback device, or ina graphical interface of the device, such as, for example, a“SoundsLike” button on a portable digital audio player. The similarityquery may include, for example, the currently playing song, thecurrently selected song in a browser, etc., and may be directed to alocal acoustic fingerprint reference database residing on the digitalaudio playback device, or, alternatively, to a remote acousticfingerprint database residing on a network server, such as, for example,acoustic fingerprint reference database 912. Additionally, the resultsreturned by the similarity query, i.e., the matching acousticfingerprints, may be sequenced to create a music playlist for thedigital audio playback device.

In one embodiment, acoustic fingerprint generation module 910 may residewithin a database system, a media playback tool, portable audio unit,etc. Upon receiving unknown content, acoustic fingerprint generationmodule 910 generates an acoustic fingerprint, which may be sent toacoustic fingerprint comparison module 911 over network, for example.Acoustic fingerprint generation may also occur at synchronization time,such as, for example, when a portable audio player is “docked” with ahost PC, and acoustic fingerprints may be generated from each digitalaudio file as they are transmitted from the host PC to the portableaudio player.

FIG. 1 is a top level flow diagram that illustrates a method forgenerating an acoustic fingerprint of a digital audio signal, accordingto an embodiment of the present invention.

Processing a media data file (i.e., digital audio signal) may includeopening the file, identifying the file format, and if appropriate,decompressing the file. The decompressed digital audio data stream maythen be scanned for a DC offset error, and if one is detected, theoffset may be removed. Following the DC offset correction, the digitalaudio data stream may be downsampled to 11025 Hz, which also provideslow pass filtering of the high frequency component of the digital audiosignal. In an embodiment, the downsampled, digital audio data stream isdownmixed to a mono stream. This step advantageously speeds upextraction of acoustic features and eliminates high frequency noisecomponents introduced by compression, radio broadcast, environmentalnoise, etc. In one embodiment, acoustic fingerprint generation module910 processes the file directly, while in another embodiment, thedownsampled, downmixed digital audio signal is processed by a media datafile preprocessing module (not shown in FIG. 9), and then transmitted toacoustic fingerprint generation module 910. Other digital audio sourcesmay be subjected to similar initial processing.

Acoustic fingerprints may be formed by subdividing (1411) a digitalaudio stream into a beginning portion, a middle portion and an endportion. In one embodiment, a window frame size of 96,000 samples may beused, with a frame overlap percentage of 0%. Extracting (1412), orsampling, 5 frames from the beginning portion of the digital audiosignal, 3 frames from the midpoint of the digital audio signal, and 5frames from the end of the digital audio signal provides a veryeffective frame vector creation method. In cases where the temporallength of the digital audio signal is less than the time required togenerate an acoustic fingerprint without frame overlap, front, middle,and end frames may be overlapped. Alternatively, when the temporallength of the digital audio signal is less than the time required forfront, middle and end frame sets, the middle and end frame sets may beomitted, and only a proportionate number of front frames may beextracted. In the embodiment including a window frame size of 96,000samples and a sampling rate of 11,025 Hz, a minimum digital audio signallength of approximately 9 seconds is required to generate a singleframe. This frame methodology may be optimized for music, andmodification of frame size and frame count may be performed toaccommodate smaller digital audio signals, such as, for example, soundeffects.

In another embodiment, the middle frames may be extracted from all ofthe digital audio available in the middle of the digital audio signal.Continuous feature frames may be extracted, starting from the end of thebeginning frame set and ending at the beginning of the end frame set.The total number of continuous frames may then be divided by a constant,and the result is used to determine how many frames are averagedtogether to create an averaged middle frame. For example, given 3desired middle frames and 72 seconds of middle portion digital audio, 9frames would be initially extracted and averaged together, in groups of3 frames, to create the desired 3 middle frames. Advantageously,averaging the middle portion of the digital audio signal provides abetter representative of the middle portion of a musical work, althoughwith a higher computational cost for acoustic fingerprint creation.

Generally, a plurality of frame vectors is generated (1413) from theplurality of beginning, middle and end frames, and the acousticfingerprint of the digital audio signal is created (1414) from theseframe vectors. The acoustic fingerprint may then be stored (1415) in adatabase, such as, for example, acoustic fingerprint reference database912. A more detailed description of the generation of the frame vectorsfollows with respect to FIGS. 3 through 8.

FIGS. 3 through 8 are top level flow diagrams that illustrate methodsfor generating an acoustic fingerprint of a digital audio signal,according to embodiments of the present invention.

In an embodiment, the window frame size samples are advanced into aworking buffer (313). The time domain features of the working framevector are then computed (314). The zero crossing rate is computed bystoring the sign of the previous sample, and incrementing a counter eachtime the sign of the current sample is not equal to the sign of theprevious sample, with zero samples ignored. The zero crossing total isthen divided by the frame window length, to compute the zero crossingmean feature. The absolute value of each sample is also summed into atemporary variable, which is also divided by the frame window length tocompute the sample mean value. This result is divided by theroot-mean-square of the samples in the frame window, to compute themean/RMS ratio feature. Additionally, the mean energy value is storedfor each block of 10624 samples within the frame. The absolute value ofthe difference from block to block is then averaged to compute the meanenergy delta feature.

Next, a wavelet transform, such as, for example, a Haar wavelettransform, with transform size of 64 samples, using, for example, ½ forthe high pass and low pass components of the transform, is applied (315)to the frame audio samples. Each transform may be overlapped by 50%, andthe resulting coefficients are summed into a 64 point array. The numberof transforms that have been performed then divides each point in thearray, and the minimum array value is stored as the normalization value.The absolute value of each array value minus the normalization value isthen stored in the array, any values less than 1 are set to 0, and thefinal array values are converted to log space using the equationarray[I]=20*log10(array[I]). These log scaled values are then sorted(321, detail FIG. 8) into ascending order, to create a wavelet domainfeature bank.

Subsequent to the wavelet computation, a window of 64 samples in lengthis applied (317), such as, for example, a Blackman-Harris window, and aFast Fourier transform is applied (318). The resulting power bands aresummed in a 32 point array, converted (319) to a log scale using theequation spec[I]=log10(spec[I]/4096)+6, and then the difference from theprevious transform is summed in a companion spectral band delta array of32 points. This is repeated, with a 50% overlap between each transform,across the entire frame window. Additionally, after each transform isconverted to log scale, the sum of the second and third bands, times 5,is stored in an array (e.g., “beatStore”), indexed (detail FIG. 6) bythe transform number.

After the other features have been extracted, a two-stage Fouriertransform may then be applied (320). The first stage transform isperformed on a 512 point unwindowed sample block across the entire framewindow, with a 85% overlap between each transform. Alternatively, aBlackman-Harris window may be used. The third power band of each firststage Fourier transform may be stored in a queue structure limited, forexample, to 512 elements. Once the queue structure is full with 512elements (i.e., in this embodiment, every 44 first stage transforms),the second stage Fourier transform is performed on the 512 output datapoints of the first stage transform. The first 32 power bands of thesecond stage transform are summed in an array (e.g., “f2Spec”). Afterthe last first stage Fourier transform, the array is divided by thenumber of second stage transforms to produce the mean average. Selectionof different first stage bands for input to the second stage process isalso possible, and the usage of a wavelet or DCT transform to summarizethe second stage is also contemplated.

After the calculation of the last Fourier transform, the indexed array(e.g., “beatStore”) may be processed using a beat tracking algorithm.The minimum value in the array is found, and each array value isadjusted such that array[I]=array[I]−minimum val. Then, the maximumvalue in the array is found, and a constant, (e.g., “beatmax”) isdefined to be 80% of the maximum value in the array. For each value inthe array which is greater than the constant, if all the array values+−4array slots are less than the current value, and it has been more than14 slots since the last detected beat, a beat is detected and the beatper minute, or BPM, feature is determined (FIG. 6). More precise beattracking methods may also be utilized.

Upon completing the spectral domain calculations, the frame finalizationprocess may be performed and the acoustic fingerprint created (321).First, the spectral power band means are converted (812) to spectralresidual bands by finding the minimum spectral band mean, andsubtracting it from each spectral band mean. Next the sum of thespectral residuals may be stored as the spectral residual sum feature.Finally, depending on the aggregation type, the acoustic fingerprint,consisting of the spectral residuals, the spectral deltas, the sortedwavelet residuals, the beat feature, the mean/RMS ratio, the zerocrossing rate, and the mean energy delta feature may be stored (818).

In a preferred embodiment, acoustic fingerprint comparison module 911may reside within a music management application, such assynchronization software for a portable music player. In thisembodiment, the media file contains the digital audio signal. Uponreceiving the new acoustic fingerprint from acoustic fingerprintgeneration module 910, the acoustic fingerprint may be associated with amedia key specific to the media data file from which the acousticfingerprint was extracted. Alternatively, a check may be performed todetermine whether the acoustic fingerprint is a duplicate, e.g.,identical, within a particular similarity threshold, etc., of anyexisting acoustic fingerprints in the associated fingerprint database,such as, for example, acoustic fingerprint reference database 912.Depending on memory and response time requirements, the nearest neighborset for the new acoustic fingerprint may be calculated using one or moreweight banks and acoustic fingerprint reference database 912. Thisprecomputed, nearest neighbor set may then be stored in acousticfingerprint reference database 912, along with the new acousticfingerprint and media identifier.

In one embodiment, after generating acoustic fingerprints and optionallyprecomputing nearest neighbor sets for each media file that has beenadded to the management application, or is pending synchronization tothe media player, acoustic fingerprint reference database 912 may beuploaded to the media player. This allows the more computationallyexpensive generation and comparison processes to be performed on thefaster host PC, leaving only query operations on the portable device.

A query (e.g., a “SoundsLike” query) may take several forms, dependingupon the host device and audio type. In the case of a portable audioplayer, a button may be pressed when any track is selected in the browselisting, or a when a track (i.e., a digital audio signal) is currentlybeing played back. Upon depression of the “SoundsLike” button, theassociated media ID for the currently selected, or currently playing,media file is retrieved and passed to a “SoundsLike” database module onthe device. If no nearest neighbor set has been precomputed, theacoustic fingerprint database (e.g., acoustic fingerprint database 912)may be loaded and the currently selected weight bank may be used to findthe closest acoustic fingerprints to the acoustic fingerprint associatedwith the query media ID. Alternatively, if the nearest neighbor set hasbeen precomputed, an index may be used to jump directly to theprecomputed set of media ID's that are most similar in the currentweight set to the query media ID. This set is then returned to the mediaplayer, which proceeds to create a playlist from the associated mediafiles for each media ID.

If the portable audio device is receiving an unindexed digital audiosignal, such as, for example, a radio, microphone, internet stream,line-in source, etc., then an acoustic fingerprint may be created fromthe input digital audio stream, preferably using 13 window frame samplesof digital audio for the acoustic fingerprint, as discussed above. Thisacoustic fingerprint may then be added to acoustic fingerprint referencedatabase 912 and a query can then be performed. In this embodiment,acoustic fingerprint generation module 910 and acoustic fingerprintcomparison module 911 both reside on the portable audio device (assoftware components, for example). This allows a device to integrate anysource of digital audio into the query process for a user, such asseeding a playlist from a user's personal audio collection from a songthey hear on the radio, or in a club.

In the event that the input digital audio source contains insufficientmaterial to generate an acceptable acoustic fingerprint, in oneembodiment, acoustic fingerprint identification module 913 may map theinput digital audio signal to a known acoustic fingerprint, while inanother embodiment, acoustic fingerprint identification module 913 mayinterpret a melodic pattern from the input digital audio signal (e.g., ahummed tune). In both embodiments, the resulting identifier returned byacoustic fingerprint identification module 913 may be used to retrieve areference acoustic fingerprint stored in acoustic fingerprint referencedatabase 912.

In a further embodiment, a graphical user interface may be provided toallow the user of system 900 to select a weight bank to tune the systemin different fashions. For instance, one weight bank may weight thelower frequency features, such as the first few second stage FFTfeatures and the beat feature, higher than the vocal range features, inorder to focus a search on tempo and rhythm characteristics in thefingerprint, while another may weight the features more evenly for ablended search that takes vocals, instrumentation, and rhythm intoaccount. Additionally, a slider graphical interface, similar to agraphics equalizer, may be presented to the user to allow manual controlover the weight banks, In this embodiment, each slider may be associatedwith one or more features to manual tune acoustic fingerprintcomparisons.

In another embodiment, a “more like this” “less like this” feature maybe provided, in which acoustic fingerprint comparison module 911receives and processes two acoustically fingerprinted tracks and shiftsthe current weight bank to reduce the weight of dissimilar features inthe selected acoustic fingerprints and raise the weight of similarfeatures, as appropriate. This feature advantageously provides anintuitive mechanism for a non-technical user to further train acousticfingerprint comparison module 911 to the user's individual tastes.Additional methods of weight adjustment, including, for example,allowing a user to select multiple acoustic fingerprints, training aweight set via a Bayesian filter or neural network, etc., are alsocontemplated by the present invention.

In a further embodiment, a sorting method may be used on nearestneighbor sets to create a playlist, including, for example, a randomsort, sorting by similarity, a merge sort from two or more queries, arandom merge from two or more queries, a thresholded merge from two ormore queries (where the similarity factor for each duplicate item in themerged sets is summed for each item which exists in more than one queryset, and items below a certain threshold are removed from the finallist), a acoustic fingerprint-based sort, etc. In the acousticfingerprint-based sort, for example, a special comparison may beperformed between the acoustic fingerprints within the result set, wherethe first and last sets of feature vectors in each acoustic fingerprintare compared to all of the other acoustic fingerprints in the resultset, with the resulting sort order based on the minimization of theweighted error between the first and last part of each acousticfingerprint. This sort may include selecting a seed track, and for eachof the other acoustic fingerprints, finding the acoustic fingerprintwith the smallest error, and then repeating the process until eachacoustic fingerprint has been moved into the result list. In yet anotherembodiment, additional metadata, such as genre or album, or perceptualmetadata, such as emotional or sonic descriptors, may be used as a finalfilter on the result set.

Generally, the above-described systems and methods may be implemented ona computer server, personal computer, in a distributed processingenvironment, or the like, or on a separate programmed general purposecomputer having database management and user interface capabilities.Additionally, the systems and methods of this invention may beimplemented on a special purpose computer, a programmed microprocessoror microcontroller and peripheral integrated circuit element(s), an ASICor other integrated circuit, a digital signal processor, a hard-wiredelectronic or logic circuit such as discrete element circuit, aprogrammable logic device such as PLD, PLA, FPGA, PAL, or the like, or aneural network and/or through the use of fuzzy logic. In general, anydevice capable of implementing a state machine that is in turn capableof implementing the flowcharts illustrated herein may be used toimplement the invention.

Furthermore, the disclosed methods may be readily implemented insoftware using object or object-oriented software developmentenvironments that provide portable source code that can be used on avariety of computer or workstation platforms. Alternatively, thedisclosed system may be implemented partially or fully in hardware usingstandard logic circuits or a VLSI design. Whether software or hardwareis used to implement the systems in accordance with this invention isdependent on the speed and/or efficiency requirements of the system, theparticular function, and the particular software or hardware systems ormicroprocessor or microcomputer systems being utilized. The systems andmethods illustrated herein however can be readily implemented inhardware and/or software using any known or later developed systems orstructures, devices and/or software by those of ordinary skill in theapplicable art from the functional description provided herein and witha general basic knowledge of the computer and data processing arts.

Moreover, the disclosed methods may be readily implemented in softwareexecuted on programmed general purpose computer, a special purposecomputer, a microprocessor, or the like. Thus, the systems and methodsof this invention can be implemented as program embedded on personalcomputer such as JAVA® or CGI script, as a resource residing on a serveror graphics workstation, as a routine embedded in a dedicated system, orthe like. The system can also be implemented by physically incorporatingthe system and method into a software and/or hardware system, such asthe hardware and software systems.

While this invention has been described in conjunction with specificembodiments thereof, many alternatives, modifications and variationswill be apparent to those skilled in the art. Accordingly, the preferredembodiments of the invention as set forth herein, are intended to beillustrative. Various changes may be made without departing from thetrue spirit and full scope of the invention as set forth herein.

1. A method for generating an acoustic fingerprint of a digital audiosignal, comprising: downsampling a received digital audio signal basedupon a predetermined frequency; subdividing the downsampled, digitalaudio signal into a beginning portion, a middle portion and an endportion; extracting a plurality of beginning frames, a plurality ofmiddle frames and a plurality of end frames from the beginning, middleand end portions of the downsampled, digital audio signal, respectively,each frame having a predetermined number of samples; generating aplurality of frame vectors from the plurality of beginning, middle andend frames, each frame vector including a plurality of acousticfeatures; creating an acoustic fingerprint of the digital audio signalbased on the plurality of frame vectors; and storing the acousticfingerprint in a database.
 2. The method according to claim 1, whereinsaid generating a frame vector for each frame includes: computing aplurality of time domain features from the predetermined number ofsamples within the frame; computing a plurality of spectral domainfeatures from the predetermined number of samples within the frame;computing a plurality of wavelet domain features from the predeterminednumber of samples; computing a plurality of second stage spectralfeatures from the predetermined spectral domain FFT results; andcreating the frame vector.
 3. The method according to claim 2, whereinsaid generating a frame vector for each frame includes: Applying alogarithmic conversion to the plurality of spectral power bands;Creating an indexed array based on the plurality of log-convertedspectral power bands; Determining a number of beats within the indexedarray; and Including the number of beats within the frame vector.
 4. Themethod according to claim 2, wherein the wavelet domain features includeusing a Haar wavelet transform, using a Blackman-Harris window.
 5. Themethod according to claim 1, further comprising: Downmixing thedownsampled audio signal to create a single channel, downsampled digitalaudio signal.
 6. The method according to claim 2, wherein thepredetermined frequency is about 11025 hz.
 7. The method according toclaim 2, wherein: The predetermined number of samples is about 96,000;The plurality of beginning frames includes five frames; The plurality ofmiddle frames includes three frames; and The plurality of end framesincludes five frames.
 8. The method according to claim 1, wherein saidextracting a plurality of middle frames includes: Determining a totalnumber of frames within the plurality of middle frames; Calculating anumber of frames to average by dividing the total number of frames by aconstant; and Averaging the plurality of middle frames, based on thenumber of frames to average, to create the constant number of frames. 9.The method according to claim 1, wherein the plurality of time domainfeatures include a zero crossing rate, a zero crossing mean, a samplemean and RMS ratio, a mean energy value, and a mean energy delta value.10. A method for generating an acoustic fingerprint frame vector from aframe extracted from a digital audio signal, comprising: Computing aplurality of time domain features from a plurality of samples within theframe; Applying a window function to the plurality of samples; Applyinga Fast Fourier Transform to the plurality of windowed samples to createa plurality of spectral power bands; Determining the number of beatsfrom the spectral power bands; Selecting one or more output spectralpower bands and using one or more first stage FFT outputs as input for asecond Fast Fourier Transform; Selecting one or more output second stagepower bands, summing across all output second stage Fast FourierTransforms, and normalizing the resulting sum by the number of inputTransforms; Creating an acoustic fingerprint frame vector including theplurality of second stage normalized bands, the plurality of time domainfeatures and the number of beats; and Storing the acoustic fingerprintframe vector in a memory.
 11. The method according to claim 10, whereinthe plurality of time domain features include a zero crossing rate, azero crossing mean, a sample mean and RMS ratio, a mean energy value,and a mean energy delta value.
 12. The method according to claim 10,wherein the wavelet domain features include using a Haar wavelettransform, using a Blackman-Harris window.
 13. The method according toclaim 10, wherein the plurality of samples consists of about 96,000samples.
 14. An information storage medium storing information operableto perform the method of any of the preceding claims.
 15. A system assubstantially herein described.
 16. A system for generating an acousticfingerprint of a digital audio signal, comprising: means fordownsampling a received digital audio signal based upon a predeterminedfrequency; means for subdividing the downsampled, digital audio signalinto a beginning portion, a middle portion and an end portion; means forextracting a plurality of beginning frames, a plurality of middle framesand a plurality of end frames from the beginning, middle and endportions of the downsampled, digital audio signal, respectively, eachframe having a predetermined number of samples; means for generating aplurality of frame vectors from the plurality of beginning, middle andend frames, each frame vector including a plurality of spectral residualbands and a plurality of time domain features; means for creating anacoustic fingerprint of the digital audio signal based on the pluralityof frame vectors; and means for storing the acoustic fingerprint in adatabase.
 17. The system according to claim 16, wherein said means forgenerating a frame vector for each frame includes: means for computing aplurality of time domain features from a plurality of samples within theframe; means for applying a window function to the plurality of samples;means for applying a Fast Fourier Transform to the plurality of windowedsamples to create a plurality of spectral power bands; means fordetermining the number of beats from the spectral power bands; means forselecting one or more output spectral power bands, and using one or morefirst stage FFT outputs as input for a second Fast Fourier Transform;means for selecting one or more output second stage power bands, summingacross all output second stage Fast Fourier Transforms, and normalizingthe resulting sum by the number of input Transforms; means for creatingan acoustic fingerprint frame vector including the plurality of secondstage normalized bands, the plurality of time domain features and thenumber of beats; and means for creating the frame vector.
 18. The systemaccording to claim 17, wherein said means for generating a frame vectorfor each frame includes: means for applying a logarithmic conversion tothe plurality of spectral power bands; means for creating an indexedarray based on the plurality of log-converted spectral power bands;means for determining a number of beats within the indexed array; andmeans for including the number of beats within the frame vector.
 19. Thesystem according to claim 17, wherein the wavelet domain featuresinclude using a Haar wavelet transform, using a Blackman-Harris window.20. The system according to claim 17, wherein the predetermined numberof samples consists of about 96,000 samples.
 21. A method of sequencingdigital media playback, comprising: receiving a plurality of acousticfingerprints as the seed; selecting a weight bank for comparing the seedacoustic fingerprints; comparing the seed fingerprint with a pluralityof reference fingerprints using a selected weight bank; selecting asubset of the reference fingerprints based on their similarity with theseed fingerprint; applying a sort mechanism to the resultant subset; andsequencing digital media playback using resultant sorted subset.
 22. Themethod according to claim 21, wherein said selecting a weight bankincludes: comparing the seed fingerprint with a plurality of weightclass reference vectors; and selecting the weight class vector which ismost similar to the seed fingerprint.
 23. The method according to claim21, wherein applying a sort mechanism includes: randomly selecting astart acoustic fingerprint from the result set and moving it to thefinal sorted set; computing the similarity between the last acousticfingerprint in the sorted set and each remaining acoustic fingerprint inthe result set; moving the acoustic fingerprint with the highestsimilarity into the final sorted set; and repeating until all acousticfingerprints have been moved into the final sorted set.
 24. The methodaccording to claim 21, wherein applying a sort mechanism includes:randomly selecting an acoustic fingerprint from the result set andmoving it to the final sorted set; and repeating until all acousticfingerprints have been moved into the final sorted set.
 25. The methodaccording to claim 21, wherein sequencing digital media playbackincludes: mapping each result acoustic fingerprint to a mediaidentifier; mapping each media identifier to a digital media element;and generating a playlist containing the sorted digital media elements.26. The method according to claim 21, wherein selecting a weight bankadditionally adds the means to retrain a weight bank which includes:providing a display component wherein a plurality of sliders elementsare linked to one or more features within the selected weight bank. 27.The method according to claim 21, wherein selecting a weight bankadditionally adds the means to retrain a weight bank which includes:providing a user interface to allow a plurality of fingerprints to bemarked as more similar; comparing said plurality of fingerprints, andraising the weight of similar features by a scaling factor, and reducingthe weight of dissimilar features by said scaling factor; andnormalizing the modified weights by said scaling factor.
 28. The methodaccording to claim 21, wherein selecting a weight bank additionally addsthe means to retrain a weight bank which includes: Providing a userinterface to allow a plurality of fingerprints to be marked as lesssimilar; Comparing said plurality of fingerprints, and lowering theweight of similar features by a scaling factor, and raising the weightof dissimilar features by said scaling factor; and Normalizing themodified weights by said scaling factor.
 29. The method according toclaim 21, wherein receiving a plurality of acoustic fingerprints as seedincludes: Generating an identification acoustic fingerprint from aninput digital audio source; Resolving the identification acousticfingerprint using a reference acoustic fingerprint database to return asequencing acoustic fingerprint identifier; and Retrieving a referencesequencing acoustic fingerprint from a reference database using saidsequencing acoustic fingerprint identifier.